DC Airbnb Analysis

Import the data

library(tidyverse)
airbnb_df <- read_csv("listings_2.csv")

Clean the data

airbnb_df[,c("host_response_rate", 
             "bathrooms",
             "weekly_price",
             "monthly_price",
             "cleaning_fee",
             "security_deposit",
             "guests_included",
             "extra_people",
             "review_scores_rating")] <- NULL

It is necessary to convert the data type of some categorical variables of interest to factor.

airbnb_df$neighbourhood_cleansed <- as.factor(airbnb_df$neighbourhood_cleansed)
airbnb_df$neighbourhood <- as.factor(airbnb_df$neighbourhood)
airbnb_df$property_type <- as.factor(airbnb_df$property_type)
airbnb_df$room_type <- as.factor(airbnb_df$room_type)
airbnb_df$bed_type <- as.factor(airbnb_df$bed_type)
airbnb_df$cancellation_policy <- as.factor(airbnb_df$cancellation_policy)

1. Describe your dataset. Give the source of the dataset and a metadata listing for each variable.

Source: Detailed Listings data for Washington, D.C. from Inside Airbnb (http://insideairbnb.com/get-the-data.html)

Variable Type Description
host_id num Host identification number
host_name char Name of host
neighbourhood_cleansed factor Property’s neighborhood group
neighbourhood factor Property’s neighborhood
zipcode char Property’s zipcode
latitude num Latitude coordinate of property
longitude num Longitude coordinate of propert
property_type factor Type of property
room_type factor Type of room
accommodates num Number of people the property can accommodate
bedrooms num Number of available bedrooms
beds num Number of available beds
bed_type factor Type of bed
price num Listing price
minimum_nights num Minimum of night per stay
availability_365 num Property’s availaility in the next 365 days
availability_30 num Property’s availaility in the next 30 days
availability_60 num Property’s availaility in the next 60 days
availability_90 num Property’s availaility in the next 90 days
reviews_per_month num Number of reviews per month
cancellation_policy factor Cancellation policy

2. Read in your dataset and calculate

a. The number of missing values in your dataset

b. The percentage of missing values in your dataset.

## # A tibble: 1 x 1
##   count
##   <int>
## 1  2000
## [1] 1.040626

There are 2000 missing values (1.04%) in this dataset recognized by R as NA.

3. Give TWO questions about your dataset that you are going to investigate.

Question 1: What are the most common Airbnb properties in D.C.? What is the variation in price for different types of property?

Question 2: How does location influence property rental price?

4. Perform EDA to answer your research questions.

Question 1. What are the most common Airbnb properties in D.C.? What is the variation in price for different types of property?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    79.0   115.0   193.1   187.2 10000.0
## [1] 304.5666

  • $115 is the average listing price for properties on Airbnb in this data sample.
  • Minimum price: $10
  • Maximum price: $10,000

## # A tibble: 3 x 8
##   property_type average_price min_price max_price   std min_night accom
##   <fct>                 <dbl>     <dbl>     <dbl> <dbl>     <dbl> <dbl>
## 1 Hostel                 45.6        33       150  31.5         1     6
## 2 Bungalow               92          30       325  70.5         1     3
## 3 Guest suite           101.         30      1500  74.2         1     3
## # … with 1 more variable: range <dbl>
## # A tibble: 3 x 8
##   property_type average_price min_price max_price   std min_night accom
##   <fct>                 <dbl>     <dbl>     <dbl> <dbl>     <dbl> <dbl>
## 1 Serviced apa…          266.       100       569  101.         1     5
## 2 Dome house             300        300       300   NA          3    10
## 3 Resort                 500        500       500   NA          3     4
## # … with 1 more variable: range <dbl>

  • Hostel has the lowest mean price of all, followed by Bungalow and Guest Suite.
  • Dome house and Resort have the highest mean prices with indeterminate standard variation since there is only one entry of data recorded for each of these properties.

From the visualization, apartment is the most common property on Airbnb. Now let’s look at the frequency table of the property type to confirm:

##                  Var1       Freq
## 1           Apartment 46.0008741
## 2               House 21.2084790
## 3           Townhouse 15.4173951
## 4         Condominium  8.2604895
## 5         Guest suite  5.6927448
## 6  Serviced apartment  0.6555944
## 7                Loft  0.6446678
## 8   Bed and breakfast  0.6228147
## 9          Guesthouse  0.5135490
## 10              Other  0.2513112
  • Apartments: 46% of listing in this data sample
  • House: 21.2% of listing
  • Townhouse: 15.4% of listing
  • Condominium: 8% of listing

  • Resort has a highest median price, but there is only one entry of data for this property.
  • House has the most outliers and variation in price compared to other properties.
  • Hostel is the cheapest Airbnb property.
  • Townhouse and Condominium appear to have the same price range, but Condominium is slightly cheaper.

Question 2. How does location influence property rental price?

  • Georgetown area has the highest standard deviation in listing price.
  • Sheridan, Barry Farm, Buena Vista has the smallest dispersion of price.

Let’s plot some boxplots to have a further insight into the price of each neighborhood:

  • West End, Foggy Botttom, GWU neighborhood group has the most variation in price range, followed by Southwest/Waterfront.
  • Downtown, Chinatown, Penn Quarters area has the highest median price.
  • Eastland Gardens, Kenilworth area has the lowest median price.
  • Columbia Heights-Mt.Pleasant and Cathedral Heights appear to have the same price range with Cathedral Heights having a slightly cheaper median price.

5. Write a two paragraph summary about what your EDA is telling you about your data.

On average, airbnb users visiting DC should expect to pay $115 per night. The most expensive accomodation costs $10,000 for a minimum of four nights, and the cheapest option costs only $10 for a night. The three most affordable airbnbs are hostel ($46/night), bungalow ($90.00/night), and guest suite ($100/night) with standard deviation of $32, $70, and $74 respectively. On other other hand, resort is the most expensive lodging with indeterminate standard deviation since there is only one listing of this property type. From the visualization, the most common listing in DC with fairly low price is apartments followed by townhouses, single homes, and condominiums. Apartments account for 46% of the total listing, whereas only 15% of listed property are hostels. Looking at the “Price Variation of Property Type” graph, the single house category has the greatest variation in price with the highest price outlier.

To determine the Airbnb price range for D.C. neighborhoods, we first look at the standard deviation visualization. Georgeotown, Burleith-Hillandale area has a greatest dispersion in price. This is due to an outlier - the $10,000 Historic Georgetown Residence. In contrast, Sheridan, Barry Farm, Buena Vista neighborhood has the lowest price’s standard deviation. Furthermore, Downtown, Chinatown, Penn Quarters area has the highest median price. Columbia Heights-Mt.Pleasant and Cathedral Heights appear to have the same price range with Cathedral Heights having a slightly cheaper median price.


In case you are bored, here is an interactive map for detailed listing of Airbnb properties in D.C.

6. Give a third question about your dataset that you want to investigate using a two-sample t-test.

Question: Do townhouses have a higher average price compared to condominiums?

A two sample t-test will be performed at the 95% confidence level.

Null hypothesis Alternative hypothesis
The average price of townhouses is equal to condos The average price of townhouses is higher than condos.
\(H_{o}:\mu_{T} = \mu_{C}\) \(H_{a}:\mu_{T} > \mu_{C}\)
## 
##  Welch Two Sample t-test
## 
## data:  townhouse$price and condo$price
## t = 0.94315, df = 1516.4, p-value = 0.1729
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -6.361168       Inf
## sample estimates:
## mean of x mean of y 
##  180.9809  172.4431

P-value: 0.1729 > \(0.05 = \alpha\)

Conclusion: Fail to reject null hypothesis.

Real-world interpretation: The difference between the two samples’ means is statistically nonsignificant. There is not enough evidence in our data to prove that the average price of townhouses is higher than condominiums. The price difference we observed in the visualization occurs likely due to chance.

The 95% confidence interval means we can be 95% sure that the 95% confidence interval contains the true difference between the means of these two groups. Here a one-tail confidence interval from -6.36 to \(\infty\) was used. This confidence interval contains 0 which implies that 0 is a reasonable possibility for the true value of the difference. Hence, we fail to reject the null hypothesis.

7. Give a fourth question about your dataset that you can investigate using a Chi-Square test.

Question: Is there a relationship between room type and bed type?

We’ll perform a Chi-square test with \(\alpha = 0.05\).

Null hypothesis Alternative hypothesis
\(H_{o}:\) Room type and bed type are independent. \(H_{a}:\) Room type and bed type are dependent.

## Warning in chisq.test(table_room_bed): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  table_room_bed
## X-squared = 90.777, df = 12, p-value = 3.491e-14

P-value: 3.491E-14 < \(0.05 = \alpha\).

Conclusion: Reject null hypothesis.

Real-world interpretation: There is enough evidence to show that room type and bed type are related. However, the result may not be valid due to the test’s error.

8. Give a fifth question about your dataset that involves the covariation of two quantitative variables.

Question: Assuming I’m a hostel owner, I would like to predict the price depending on the number of beds I have. How does the price per night relate to the number of beds?

Null hypothesis: \(H_{o}:\) There is no correlation between the number of beds and the price.

Alternative hypothesis: \(H_{a}:\) There is correlation between the number of beds and the price.

## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).

## 
## Call:
## lm(formula = airbnb_df$price ~ airbnb_df$beds)
## 
## Coefficients:
##    (Intercept)  airbnb_df$beds  
##          95.96           50.52
## 
## Call:
## lm(formula = airbnb_df$price ~ airbnb_df$beds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2572.1   -96.5   -59.5    -6.0  9651.4 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      95.964      4.942   19.42   <2e-16 ***
## airbnb_df$beds   50.523      2.019   25.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 293.8 on 9137 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.06411,    Adjusted R-squared:  0.06401 
## F-statistic: 625.9 on 1 and 9137 DF,  p-value: < 2.2e-16
The linear model is: \[price = 50.52\ *\ number\ of\ beds\ +\ 95.96\]

P-value: 2.2E-16 < 0.05 = \(\alpha\)

Our model is statistically significant, and there is a relationship between the number of beds and the price.

## [1] 0.2532029

With \(r^2 = 0.06401\), we understand that 6.4% of variation in the price is due to the the number of beds.

\(r = 0.2532\) indicates a weak positive correlation between the two variables.

9. Write a two-paragraph summary of any ethical concerns about your dataset and/or project.

There are some limitations I encounterred while analyzing the dataset. The Airbnb data from Inside Airbnb was last retrieved on November 22 in 2019, so the information will be solely based on what have been scraped from Airbnb website on that date. In addition, historical data for the property prices are not available in this dataset.

While performing Chi-squared test for the two categorical variables room_type and bed_type, the test results came with a warning “Chi-squared approximation may be incorrect”. This refers to the small expected counts of the varibles in the dataset; hence, the approximation may be poor. Since the p-value is relatively small compared to alpha, the null hypothesis was rejected.